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ABSTRACT 

The Kolmogorov-Smirnov goodness-of-f it test is exact only 
when the hypothesized distribution is continuous, hut recently 
Conover has extended the Kolmogorov-Smirnov test to obtain a 
test that is exact in the case of discrete distributions. 
Reasons for using this procedure instead of the regular 
Kolmogorov-Smirnov test when the hypothesized distribution 
is discrete are given. A computer subroutine is developed 
to allow easy use of the procedure. The subroutine is then 
used to demonstrate the conservatism of the regular Kolmogorov- 
Smirnov test in this case and to investigate some properties 
of the asymptotic distributions of the test statistics. 
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I. INTRODUCTION 



Various statistical problems reduce to the choice of a 
parametric form of a probability distribution of a population. 
A one sample goodness-of-f it test is a test of the hypothesis 
H q : F(x) = H(x) for all x, where F is the unknown cumulative 
distribution function of the population in question and H is 
the hypothesized cumulative distribution function. There are 
various test statistics that can be used in goodness-of-f it 
tests. The choice of which statistic to use depends on the 
nature of the sample, whether F is continuous or discrete, 
whether all of the parameters of H are known or are estimated 
from the sample, or whether H is a member of a certain class 
of distributions. .The two most commonly used tests are the 
Chi-square and Kolmogorov-Smirnov (K-S) type goodness-of-f it 
tests . 

The Chi-square test is based on a test statistic that is 
asymptotically distributed as a Chi-square random variable, 
and therefore is used when the sample size is relatively large 
The Chi-square test does not require major assumptions on the 
hypothesized distribution and can be used when the parameters 
of the hypothesized distribution are estimated from the sample 
The hypothesized distribution may be either discrete or contin 
uous and the data may be observations of the population or 
grouped observations of the population. 
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The Kolmogorov- Smirnov test statistic has a known distri- 
bution for all sample sizes which makes the test exact. The 
K-S test may be preferred to the Chi-square test when the samole 
size is small because of the exactness of the K-S test. There 
is some controversy as to which of the two tests is more power- 
ful. The relative power has been studied (see Massey, /“ 7_7) 
and the K-S test appears to be more powerful in some cases 
while the Chi-square test is more powerful in others. Tradi- 
tionally, a major requirement for the K-S test has been that 
the hypothesized distribution, H, must be continuous. If H 
is not continuous, then a test of the hypothesis H q using the 
traditional K-S tables is known to be conservative (see Noether, 

Z~9_7) . 

Unfortunately, the exact degree of conservatism is not 
known. W. J. Conover /~3_7 derived a method to use a K-S type 
test when the hypothesized distribution is discrete or when 
the data has already been grouped (see Darmosiswoys /~5_/) » 
but the computations using this method are long and involved. 

In what follows, a program is developed to be used on a digital 
computer employing Conover's method. This program is then used 
to investigate the asymptotic distributions of the test statis- 
tics . 

A description of notation used herein is contained in the 
following list: 
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Notation 



Description 



S 



n 



n 



H 



X 1’ X 2 X n 



X (l) ,X ( 2) X 



(n) 



H 



H. 



Empirical distribution function of a 
random sample of size n. 

Sample size. 

Level of significance of test. 

Critical level of test. 

Unknown distribution function of a 
random sample . 

Hypothesized distribution function. 
Random sample of size n. 

Ordered rearrangement of the random 
sample X^, . . . ,X n in ascending order. 

A null hypothesis in test hypotheses. 

An alternate hypothesis in test 
hypotheses . 
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II. DESCRIPTION OF CON OVER °S PROCEDURE 



A. KOLMOGOROV -SMIRNOV TYPE TESTS AND TEST STATISTICS 



One sample K-S type tests are goodness-of-f it tests that 
compare the empirical cumulative distribution function of a 
random sample to a hypothesized cumulative distribution 
function. If the empirical cumulative distribution function 
is not close, in the sup norm sense, to the hypothesized 
cumulative distribution function, then the conclusion is 
made that the random sample did not come from the hypothesized 
distribution . 



Let X^jXg, • . . ,X n be independent random variables (obser- 
vations) each having the same unknown distribution F. If 
X(i) X (2) ,...,X (n) represents the rearrangement of 
X^ in asending order, then the empirical cumulative distri- 
bution function S n is defined by: 







0 


if 


x^Xd) 




s ( 

n 


x) = < 


k/n if 


X (k) ~ x<X (k+l) ’ 


k = 1,2, . 






1 


if 


x>X, s 
(n) 




The K-S 


test 


may 


be use 


d to test the three 


following 


1 . 


H q : F(x) 


= H(x) 


for all x 






H 1 : F(x) 


/ H(x) 


for some x 




2. 


H q : F(x) 


— H( x) 


for all x 






H 1 : F(x) 


<H(x) 


for some x 
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3. H q : F(x)^H(x) for all x 
H^: FCxJ^HCx) for some x 

In each hypothesis, H is a specified distribution function. 

One of the following test statistics is used depending on 
the hypotheses being tested: 

1. D = sup x |H(x)-S n (x) | 

2. D“= sup (H(x)-S (x)) 

yv 1 1 

3. D + = sup x ( S n (x)-H(x)) 

For each of the three hypotheses, a sufficiently large obser- 
vation of the test statistic indicates that the null hypothesis 
should be rejected. If a is the level of significance desired 

in the test of either hypotheses 1, 2, or 3> then critical 



values c, c , or c are determined as follows, according to 
which set of hypotheses is being tested: 

1. P(D>c) = a 

2. P(D~> c - ) = a 

3. P(D + > c + ) = u 

"P" in the above equations is the measure associated with H. 

__ -J- _ -f- 

If the observation d, d , d of the statistics D, D , or D , 
respectively, exceeds the corresponding critical values, that 
null hypothesis is rejected at a level of significance of u . 
Instead of determining the critical values, we may compute 
the critical level, a , v/hich is the smallest significance 
level at which the null hypothesis would be rejected for the 
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given observation d, d , or d + , and compare it with a . If 

.x. 

a < a, then the null hypothesis is rejected while if a, 

the null hypothesis is not rejected. The two methods are 
equivalent and the level of significance in both is a . 

If H q is true and H is continuous, it is known (see 
Darling, /”4_7) that the distributions of D, D” , and D + are 
independent of H. Tables of critical values for various 
levels of significance of the test statistics D, D - , and D + 
are available for use in the K-S test when H is continuous. 
When H is discrete, the distributions of D, D - , and D + are 
not independent of H and the standard K-S tables cannot be 
used to find the critical levels of the test statistics. When 
H is discrete, the standard K-S tables can be used to give an 
approximation of the level of significance of the test because 
of the following demonstration. Let Y be a discrete random 
variable with distribution function R. If a-^.ag,... are 
points of discontinuity of R with associated probabilities 
p ,p , . . . , then, let Z be any continuous random variable with 
distribution function T such that T(a^) - , i = 1, 

2, , a Q is any point such that a Q < a^. Then 

R(a i ) = T( &i ) , i = 1,2, . . . (1) 

Let Y^,Y 2 , . . . ,Y n be a random sample from R. This random 
sample can be thought of as having been determined by a random 

sample Z-^Zg Z n from T by setting Y^ = a^ if < Z^ — 

a ^ , i = 1,2,..., k = 1,2,..., n. If R n is the empirical 
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distribution function of Y, ,Y 0 , . . . ,Y 

12 n 

distribution function of Z, ,Z_,...,Z , 

1 2 n 

R n (a i } = T n (a i } • i = 1-2, . . . 

Let d' = sup |r (a)-R(a)| . Since R 

* a 1 n i 

D' = sup i | R n (a i )-R(a i )| . 



and T^ is the empirical 
then 

( 2 ) 

is discrete, 

( 3 ) 



(1) and (2) imply R n (a i ) - R(a i ) = T^a^- T(a._) 
i = 1, 2, ... . Then, 



for all 



D' = sup i | R n ( a i) " R (a i ) 



sup. 



V a i> - T < a f 



sup. 



T n (a) - T( a) 



which implies P(D 8 > c) — P(D ^ c) for any c. The same argu- 
ment can be used for D and D + to show that P(D - ' ^ c) ^ 

P(D > c) and P(D + . > c) ^ P(D + ^ c) . Therefore, if the 
standard tables are used to construct a test when H is discrete, 
the test is conservative. 

Slakter /~ 10_7 demonstrates the conservatism of the contin- 
uous K-S test when H is discrete using a computer simulation 
to calculate an estimate of the actual level of significance, 
a of the hypothesis H q where H is the discrete uniform 
distribution with k mass points. Ten thousand random samples 
were generated from the hypothesized distribution and the 
statistic D was evaluated. « k was then estimated as the 
proportion of the ten thousand replications in which H q was 
rejected. This process was repeated for various sample sizes 
and various k and in all cases was considerably less than 
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the true a. For example, with k = 10, 50 observations, 
and a = . 05 i turned out to be .0166. 

The use of a conservative test might at first seem desir- 
able since it guarantees that the actual probability of 
rejecting the hypothesis when it is true is less than the 
predetermined probability of rejecting a true hypothesis. 
Unfortunately, this causes a decrease in the power of the 
test. This unknown amount of decrease in the power of the 
test leads us to desire that we could calculate the exact 
significance level of our test when H is discrete. 

Since the distributions of D, D" , and D + depend on H it 
would require a prohibitive number of tables for use in 
testing H q when H is discrete, even for simple distribution 
families. For this reason, the use of K-S tests when H is 
discrete has not been investigated until recently when W. J. 
Conover demonstrated a method for finding the exact critical 
level (approximate in the two-sided case) in this instance. 

The program presented in this thesis makes use of Conover's 
procedure a practical reality. 

B. CONOVER'S PROCEDURE 

1 . Distributions of Test Statistics 

Conover derives the distribution of D, D , and D for 

H continuous or discontinuous in 3_7 • He shows that P(D > t) 

= 1 - e . where the e.'s are defined recursively as follows: 
n+1 l 

e^ = 1 and for k = 2, 3> • • • »n+l 
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1 



e 



k 




(4) 



with f k = p{x 1 < tj} , 1:2 k <n+l (5) 

The X^'s are the independent identically distributed random 
variables with distribution function F. H - ^(p) is defined as 
sup { x: H(x) — p } for 0 < p ^ 1 and as minus infinity if 
p ^0. If H is continuous, then with the use of the proba- 
bility integral transform, it is easy to see that 
1 

f k = 1 - — - — - t and (4) reduces to the form of the regular 
K-S statistic obtained by Birnbaum and Tingey £~ 2_7. We note 
that if k > n(l-t)+l, then from (5), f k = 0 and the distri- 

Hr 

bution of D becomes 



P(D + > t) = 




(6) 



where m^ is the greatest integer in n(l-t)+l. The distri- 
bution of D” is very similar to D‘ and is given by P(D ^ t) 
=l-b + ^, where the b^'s are defined recursively as follows: 

b^ = 1 and for k = 2,3»«»«»n+l 




0=1 



with c k = P{x i > H' 1 | + tj| » 1— k — n+1 (8) 
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If k ^n(l-t)+l, then 



+ t >1 in (8) which implies 
= 0 and the distribution of D - becomes 



m , 



P(D” > t) = 



5 W 



n-j+1 
b . c . 

J J 



P(D > t) is approximated by P(D ^ t) = P(D > t) + P(D' 
and the following bounds for P(D S t) are given: 



P(D + > t) + P(D > t) - P(D + > t) P(D 



(9) 



- t) 



,+ 



> t) < 

P(D > t) ^ P(D‘ > t) + P( D - > t) (10) 

In most tests, P(D + > t) and P(D~ S: t) are small and therefore, 
the maximum error in this approximation is very small. 



2 . Calculation of Critical Levels 

a. Critical Level for D” 

Let d” = sup (H(x) - S (x)) be determined from 
x n 



n 



( l-d )+l, 



k-1 

n 



+ d on 



the observations. For each k such that 1 ^ ] 

draw a horizontal line with ordinal value of 

k-1 - 

the graph of H. c v is then 1 - ( + d ) unless the line 

a n 

intersects H at a discontinuity in which case c^ is one minus 

the height of H at the top of the jump. The b^’s are then 

computed from (?), and (9) is used to compute the critical 

level , P(D _ S: d”) . 1 

b. Critical Level for D + 

Let d + = sup (S (x) - H(x)) be determined from 
x n 

the observations. For each k such that 1 S k < n(l-d ) + 1, 

draw a horizontal line with ordinal value of 1 - (“~ + d ) 
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on the graph of H . f^ is then this ordinal value unless the 

line intersects the graph of H at a discontinuity of H in 

which case f^ is equal to the height of H at the bottom of 

the jump. The e k 's are computed using (4), and (6) is used 

+ 



to compute the critical level, P(D 
c. Critical Level for D 
Let d = sup 



d + ). 



be determined from 



I H ( x) - S n (x) 

the observations. P(D“> d) and P(D + > d) are computed using 
(9) and (6) as described above, and (10) is used to put bounds 
on the critical level, P(D > d) . 



D . SUBR OUT INE "DISKS" 

The calculations of critical levels as described above 
can be very time consuming, especially as the number of 
observations increases. For this reason, subroutine DISKS 
(Appendix A) was developed to perform these calculations. 
Subroutine DISKS will calculate the critical levels of equa- 
tions (6) and (9) and the bounds on the critical level of D 
as in (10) for most discrete distributions (see Appendix A 
for restrictions) . Subroutine DISKS was used to calculate 
critical levels for various examples and verified with cal- 
culations of the critical levels made by hand. 

Subroutine DISKS can be modified slightly to calculate 
the exact size of a critical region for a test. For example, 
with a sample of size 10, the critical region determined from 
the standard tables for continuous distributions of size .1 
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consists of all values of D greater than .369. By insert- 
ing the value of .369 for d in a modified version of DISKS 
and the hypothesized distribution H, the exact size of the 
test when K is discontinuous (which we know is less than . 1 ) 
can be calculated. 
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III. ASYMPTOTIC DISTRIBUTIONS OF TEST STATISTICS 



A. ANALYTICAL DISTRIBUTIONS 

The asymptotic distributions of D , D , and D have been 
studied by several people for the case when H is not continu- 
ous. Schmid /~8_y showed that the limiting distributions of 
D , D , and D do exist, but are no longer independent of H. 
The limiting distributions depend on the values of H at the 
discontinuity points. Schmid showed, for example, that if 
H is discontinuous at x = x^, i = 1,2 c, H(x^ - 0) = 

H(Xj) = f 2 y and f 2 c+l = then 



lim P(D 
n— > ^ 


< k 
VF 


• ) = G(k) where 


G(k) = 


CO 

£ 


(.1 /*j 




i=- «> 


2c 




exp [ 


- — - V a. x.x 1 dx n . . . dx- 

2 Z-/ jm j mj 1 2c 






j ,m=l 




f ,i + i 


‘ f .i-i - a - a 


a jj 


(f 3+i 


-p W -p -p ^ j > j — 

' 0 ' j-i ; 


a ij = 0 


for 


i <= j-1 or i ■> j+1 



2c+ 1 

b = ( 2 7T ) _n 

3 = 1 
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and 



U {- k< *2J-l + a <V “aj-l 5 ** 1 

P 1 ’ * ‘ ‘ ,P C = _0 ° 



-k<x 2 . + 2k(P. + kf 2 .)ck, j=l, . . . ,cl 

Unfortunately, G(k) "becomes undefined when H is discrete 
since the a's "blow-up and "b "becomes zero. Conover /~3_7 
tried, as did this author, using the distributions of Section II 
to derive the asymptotic distributions, but the attempts were 
unsuccessful. For these reasons, a computer routine using 
subroutine DISKS was used to investigate the asymptotic pro- 

- 4* 

perties of the distributions of D , D , and D. Since formulations 
in the literature of the limiting distributions involve multi- 
ples of the inverse of the square root of the sample size, it 
w r as decided that values of k would be determined such that 

lim P( D — -r-) ~ a for various values of a . The asymptotic 
vn 

n^<» 

+ - 

distributions of D and D were not studied since they display 
the same basic characteristics as the asymptotic distribution 
of D. 

B. COMPUTER PROGRAM USED 

Subroutine DISKS was modified to search for the value of 
k such that P(D>^- ) was as close to, but always less than, 
a predetermined value of a as possible. Values of n between 
thirty and one hundred in increments of five were used to 
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determine k such that P(D ) = u from (10). Values of n 

between eighty-five and one hundred were sometimes not used 
since significant errors in calculations occurred, even with 
double precision calculations. 

The modified subroutine was used to investigate the 
asymptotic distribution of D when H was one of the following 
distributions : 

1. Discrete uniform with parameter m: 

0 if x <1 



H(x) = 



m 



k < x «=k+l, k = 1,2 m-1 



1 x > m 

2. Poisson with parameter /t 

M 

H(x) - ^ 

k=0 



- u k 
e ! J - 

k! 



where [x] = largest integer <x 



3. Geometric with parameter p : 

M 

H(x) = p (1 - p ) k_1 

k=l 

Each distribution was investigated for various values of its 
respective parameter. The values of k determined for the 
various values of n for each particular parametric distribution 
v/ere examined to determine if they appeared to be converging 
to some common value. The fact that the distribution of D is 
discrete suggested that the values of k would not converge in 
a uniform manner to some value, but it was hoped that, even 
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though it jumped around some, the convergence to a common 
value would be evident. By varying the values of the para- 
meters of the various distributions, these discrete distribu- 
tions would approach (in the weak convergence sense) a continu- 
ous distribution and the limiting value of k should approach 
the known limiting values of k for continuous distributions. 

For example, as m in the discrete uniform distribution increased, 
H has smaller and smaller jumps at each mass point and becomes 
"smoother" looking. If we think of the mass points being evenly 
distributed between zero and one, then, as the number of mass 
points increases, H behaves in most respects more and more like 
a continuous uniform distribution function between zero and one. 
Similarly, as the parameter of the Poisson gets larger and 
larger and as the parameter of the geometric gets smaller and 
smaller, these hypothesized cumulative distribution functions 
have smaller and smaller jumps at their points of discontinuity 
and the distribution functions get smoother and smoother. 

Since the usual K-S test is conservative when H is discrete, 
the approximating values of k for the discrete case should be 
always smaller than these known limiting values of k for the 
continuous case. 

C. RESULTS 

For each parametric distribution considered, as n increased, 
the sequence of values of k did appear to converge although, 
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as anticipated, not monotonically . Typical example values 
of k determined for various values of n are tabulated below: 



n 


k 


30 


1.095 


35 


1.183 


4o 


1.107 


45 


1.193 


50 


1.131 


55 


1.146 


6 o 


1.162 


65 


1.178 


70 


I.165 


75 


1.155 


80 


1 . 148 


90 


1.160 



These values of k were determined for the discrete uniform 
distribution with 10 mass points and a = .05. The variation 
in k as n increases is apparent, but the value of k does appear 
to be fairly constant for n greater then 50. As the parameters 
of the three distributions were changed and the discrete dis- 
tributions became "smoother" looking as described in Section III. 
B, the variation in k became less than that in the table above. 

In each parametric case that was examined, the values of k for 
n > 50 rarely varied from each other more than .03 as in the 
above example. The general tendency was for k to increase as 
n increased and then become relatively stable for n=>50. For 
n>50, the smallest value k thus obtained was recorded and then 
all the values of k for the various values of the parameters 
of each distribution were plotted. Figures 1, 2, and 3 show 
a smooth curve approximation through the plotted k values for 
the three distributions with dotted lines representing the 
asymptotic value of k for the continuous case. 
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Figure 1 shows the values of k for the discrete uniform 
distribution for various numbers of mass points. The conserva- 
tiveness of the continuous K-S test is readily apparent from 
this plot. For example, with twenty mass points the asymptotic 
k approximation is 1.16 while in the regular K-S test the 
asymptotic value of k is I.36. As the number of mass points 
increases, the value of k is increasing toward the continuous 
K-S value. One of the surprising results is how slowly k 
converges to the continuous K-S value. Even with two hundred 
mass points at a= .05. k = 1.30, which differs from I.36 by 
an amount larger than expected. 

Figure 2 depicts the values of k for the Poisson distri- 
bution with various values of the parameter. The curves have 
the same general appearance as those in Figure 1 and the same 
comments made about the discrete uniform apply here. 

Values of k determined for the geometric distribution 
with various values of the parameter are plotted in Figure 3 - 
The curves here are similar to the two preceeding distributions 
with the apparent convergence of the value of k to the continu- 
ous K-S value of k as the parameter decreases. With this 
slight modification, all of the previous comments apply here. 
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Hundreds of Mass Points of Discrete Uniform 



05 
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FIGURE 
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FIGURE 



IV. SUMMARY AND CONCLUSIONS 



1. The K-S test using the standard tabled critical values 
is conservative when the hypothesized distribution, H, is 
discrete. The test is sometimes substantially conservative 

as indicated in Figures 1, 2, and 3. The power of the test 
is reduced when the test is conservative and, therefore, it 
is desirable to know the exact size of a test instead of a 
conservative estimate. 

2. Conover's procedure can be used to obtain exact (approx- 
imate in the two-sided case) critical levels for a K-S test when 
H is discontinuous or when the data have been grouped. The 
procedure can also be used to find the exact amount of conser- 
vatism of a K-S test if the standard tables are used. The 

only drawbacks to the procedure are the lengthy and tedious 
calculations required. 

3- Subroutine DISKS was developed and tested to calculate 
the critical levels in Conover's procedure for many discrete 
distributions . 

4. As the sample size increases, the limiting distribu- 
tions of the test statistics D, D , and D + for discontinuous 
H exist, but, of the closed form limiting distributions 
investigated, they are degenerate when H is discrete. Sub- 
routine DISKS may be modified slightly to obtain an approxi- 
mation to the limiting values of k such that P(D —J==~) = « 
for any 0 — a ~ 1. 



28 



5- The limiting values of k above were approximated as 
described for three distribution families. As n increased, 
k had a general tendency to increase and become fairly constant 
for n > 50 . As the parameter of each family changed such that 
H had smaller jumps at mass points and become "smoother" looking, 
k approached the limiting value of k found in the standard 
K-S tables. Significantly, this convergence of k to the limit- 
ing value for the continuous case was much slower than antici- 
pated . 

6. Figures 1, 2, and 3 indicate that each family of 
distributions has distinctive sets of similar curves. Further 
investigation seems warranted to attempt to find an easy and 
quick means to modify the existing K-S tables for use in a 
K-S test when H is discrete. This would involve determining, 
for each family of discrete distributions, a function depending 
on n, a , and the parameters of the family that would modify 
the critical values in the standard K-S tables for continuous 
H into critical values for that particular family of distribu- 
tions . 
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APPENDIX A 



USE OF SUBROUTINE DISKS 



A . PURPOSE OF SUBROUTINE 

Subroutine DISKS uses Conover's /~ 3_7 procedure to compute 
the critical level, (the probability of getting a value of the 
test statistic as large as the observed value when : F(x) 

= H(x), for all x is true), of a Kolmogorov goodness-of-f it 
test when the hypothesized distribution is discrete. If S n 
is the cumulative empirical distribution of the sample, then 
the following test statistics are used for the specified 
alternative hypothesis: (1) alternatives of the type F = H 

use D = sup H(x) - S(x) , (2) alternatives of the type 
F H use D~ = sup (H(x) - S(x)), while ( 3 ) alternatives of 

X 

the type F H use D + = sup v (S(x) - H(x)). For a given hypothe- 

sized distribution and sample of the distribution to be tested 

_ + 

the subroutine determines the observed values of D, D , and D . 

— "f* • 

If these observed values are d, d , and d , respectively, then 
the subroutine computes the double precision quantities PDMNS, 
PDPLS, PDL, and PD where: 

PDMNS = Prob(D' > d") 

PDPLS = Prob(D + d + ) 

PDL — Prob(D — d) — PD 
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B. INPUT TO SUBROUTINE 



1. I TYPE = 1 

If all of the possible mass points of the hypothesized 
distribution are represented in the data, then I TYPE = 1 and 
the following quantities must be provided: 

X -- N-dimensional vector containing the sample 
H -- (M+l) -dimensional vector containing the values 
of the hypothesized cumulative distribution 
M -- the number of distinct data points 
N -- the total number of data points, less than 
or equal to thirty (30) 

S -- a dummy vector of length (M+l) 

2. I TYPE = 2 

If all of the possible mass points of the hypothesized 
distribution are not represented, then ITYPE = 2 and the above 
input is modified by making X a dummy vector and S a vector of 
the values of the cumulative empirical distribution. 

C. LIMITATIONS 

The only limitation to the subroutine is that N be less 
than or equal to thirty (30). For N larger than thirty (30). 
the user need only modify the second and third dimension 
statements of the program by changing 30 to the number desired. 
The user should be cautioned that, as N gets large (about one 
hundred (100)), the nature of the calculations causes signifi- 
cant errors to propagate even with double precision calculations. 
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D. TIME AND CORE REQUIREMENTS 



All of the times and core requirements that follow are 
based on runs of DISKS at W. R. Church Computer Center, Naval 
Postgraduate School, Monterey, California on an IBM 360 / 67 . 

The subroutine requires approximately 11K of core for storage 
and 6.5 seconds to compile. Execution time is approximately 
.4 seconds for N = 10, .5 seconds for N = 20 and .55 seconds 
for N = 30. 

E . VERIFICATION 

Fifteen examples were used to verify that subroutine DISKS 
calculated the desired quantities correctly. In each example, 
the calculations were performed by hand-calculations using 
Conover's procedure and then compared with the computer-calcu- 
lated values. Examples were formulated to exercise each "if" 
statement and each branching point in the subroutine at various 
levels of M and N. The following are three examples used in 
the verification process and are listed here to indicate the 
general types of examples used: 

1 . This is example 1 from Conover Z~3_7* Let H be the 
discrete uniform distribution with 5 mass points on the inte- 
gers 1, 2, 3. 4, 5* Suppose a random sample of size 10 with 
(ordered) values 1 , 1 , 1 , 2 , 2 , 2 , 3 > 3 > 3 . 3 is drawn from 
some population. Hand-calculation shows d = 0 . 0 , d = .4, 
and d = .4 yielding: 

P(D" > d“) = 1.0 

P(D + > d + ) = .02081 

0.04119 <P(D> d) < 0.04162 
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Subroutine DISKS yielded: 

PDMNS = 1.0 

PDPLS = 0.020809 

PDL = .041184 , PD = .041617 

2. This example is from Darmosiswoys Z~5_7> page 24. 

H has mass points 1, 2, and 3 such that P(X = 1) = .3624, 

P(X = 2) = .4167, and P(X = 3) = .2209 (X is a function of 

an exponential random variable, Y, with parameter 6.0 defined 
by X = 1 if 0 <Y <2.7, X = 2 if 2.7 <Y <=9.09, and X = 3 
if Y > 9-09) • This is an example of how to handle data that 
has been grouped and the original sample cannot be recovered. 

A random sample of size 15 with values 1, 2, 3, 2, 3, 3, 1, 1, 
2, 1, 3, 3, 1, 3, 3 is drawn from some population. Hand- 
calculation yielded: 

.05506 ^P(D > d) So. 0557 

Subroutine DISKS yielded: 

PDL = 0.055174 , PD = 0.055817 

3. This example illustrates how to handle discrete dis- 
tributions with a countable number of mass points. Let H be 
the Poisson distribution with parameter 0.7- Suppose a 
random sample of size 10 with values 1, 3, 2, 1, 0, 1, 3, 2, 

1, 2 is drawn from some population. Hand-calculations 
yielded : 

P(D' > d”) = .014774 
P(D + > d + ) = 0.84238 
0 .02316 < P(D > d) <0.02386. 



33 



Since the number of distinct mass points is infinite, some 

value of M must be decided upon to use in the program. H is 

truncated such that all the probability associated with mass 

s *fc 

points beyond the (M+l) mass point is assigned to the 
s 

(M+l) mass point with a corresponding grouping of sample 
data if necessary. With M = 4, ITYPE = 1 and P(X:>3) = 1-H(3) 
= .0291 is added to P(X = 3) • In this case, DISKS yielded: 
PDMNS = 0.014768 
PDPLS =1.0 

PDL = 0.023152 , PD = 0.023277 
With M = 6, ITYPE = 2 and P(X>5) = 1-H(5) = 0.0001 to four 
decimal places. In this case, DISKS yielded: 

PDMNS = 0.014772 
PDPLS = 0.842311 

PDL = 0.023156 , PD = 0.02382 
The actual hypothesized distribution is a truncated distri- 
bution, but, if the probability of all the mass points beyond 
the (M+l) mass points is relatively small, as in the above 

case with M = 6, the critical levels calculated by DISKS are 
very good approximations to the critical levels of the untrun- 
cated hypothesized distribution. 
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II. SUBROUTINE TO COMPUTE CRITICAL LEVELS 



C 
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c 

c 



Jjtsjc * ^ V # # * # ## * 3$c # * # # :£ # # * 5*c # £ # # 5js £ # # aj: # >;* £ ## 



* SUBROUTINE DISKS I X ,H ,M ,N , I TYPE , S,PDMNS , PDPLS , PD L , PD )* 

* 

* SUBROUTINE DISKS COMPUTES THE CRITICAL LEVELS FC R 

* THE THREE K-S STATIS T ICS ACCORDING TO CONOVER'S 

* PROCEDURE I JOURNAL OF THE AMERICAN STATISTICAL 

* ASSOCIATION, SEPT., 1972, VOL 67, MO :> 39 , P^S^ 1 -6 } 

* WHEN THE HYPOTHESIZED DISTRIBUTION IS DISCRETE. 
v 

* PARAMETERS 

* X - N-DI MENS IONAL VECTOR OF DATA POINTS THAT 

* ARE REQUIRED ONLY IF ITYPE = I 



* H - M-H-DIM ENSIGN AL VECTOR CF VALUES OF THE 

* HYPOTHEZIZED CUMULATIVE DISTRIBUTION 

* FUNCTION AT EACH DISTINCT VALUE OF X WITH 

* H( l) = 0.0 AND HIM * 1 ) = 1.0 

* 

* M - NUMBER OF DISTINCT DATA POINTS 

* N - NUMBER OF OATA POINTS 

* 

* ITYPE - 1 IF ALL POSSIBLE MASS POINTS ARE 

* REPRESENTED IN THE DATA 



* 2 IF NOT ALL POSSIBLE MASS POINTS ARE 

* REPRESENTED 

* S - VALUES OF THE EMPIRICAL DISTRIBUTICN 

* FUNCTION AT MASS POINTS. INPUT ONLY IF 

* ITYPE = 2 

PDMNS - DOUBLE PRECISION OUTPUT CRITICAL LEVEL 

* FOR D- MINUS 

* PDPLS - DOUBLE PRECISION OUTPUT CRITICAL LEVEL 

* FOR D-PLUS 



* PDL - DOUBLE PRECISION OUTPUT LOWER BOUND ON 

* CRITICAL LEVEL FOR D 

* PD - DOUBLE PRECISION OUTPUT UPPER BOUND ON 

* CRITICAL LEVEL FOR D 
$ 

* USAGE - REJECT HYPOTHESIS F t X » = H(X) IF PREDE- 

* TERM! NED CRITICAL LEVEL IS GREATER THAN PD 

* REJECT HYPOTHESIS FIX) GREATER THAN H ( X ) 

* IF PREDETERMINED CRITICAL LEVEL IS 

* GREATER THAN PDMNS 

* REJECT HYPOTHESIS FIX) LESS THAN HIX) IF * 

* PREDETERMINED CRITICAL LEVEL IS GREATER * 

* THAN PDPLS * 

* * 

if. #sjc # ** ft #s): -f ❖ ** £ V* 



SUBROUTINE DISKS I X , HtM»N, ITYPE, S ,PDKNS , PDPLS, PDL, PD » 
D I MENS ION XI N) ,H (N ) ,S IN) ,C0( 30 , 30) , J ( 30 ) ,F( 30) ,CD< 30) 
DIMENSION BIBO), EI30), BDI30), EDI30), CI3C), FDI30) 

REAL *8 CO ,F,CLi,FD,B»E,6D,ED,C,BSUM,ESUM, PDMNS ,POPLS 
REAL-8 PDF, POP, Y, PDL, PD 
NR 1 = N-l 
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■st )t it -sj- -it -st -st -st -st -:t -it * st -st st it it it it it it it it it it it it -s it it it it it it it it it it it it it it it it it it it is 



c 

c 

c 

c 

c 

c 

c 



c 

c 



c 



c 

c 

c 

c 

c 

c 

c 

c 



c 

c 



c 

c 

c 

c 

c 

c 

c 



c 



RN = FLOAT (N) 

D MM S = 0.0 
DPLS = 0.0 
MP1 = M + l 
EFS - l. c — 6 

IF ( ITYPE.EQ.2) GO TO 8 



# sj: ry * £ 



• *’* J’ *>. ■/ 



'•X 'V ^ ■ 

^ *p* /,s 



-r ^ 'Y' V ¥ V ^ *i' ^ 'i' ^ V ?jc 3p ^ ^ Zip 

* 



SORT X«S IN ASCENDING ORDER • J IS SORTED INDEX 






*>' 'f <V 



❖ 

Js $£*£#:** sfcsfie# ## ❖ # Jfc## 



DC 1 K1=1,N 
J ( K 1 ) = KI 
I CONTINUE 



DO 3 K2=1,NM1 
IY = K2 + 1 



DC 2 K 3 = I Y . N 

IF (X(J(K2)) .LE.X< J(K3))» GO TO 2 
I CUM = J ( K2 ) 

J(K2) = J ( K3 ) 

J ( K3 ) = I DUM 
2 CONTINUE 



3 CONTINUE 



%JL X X X X «-*« X X X X V* X X X X X %f+ X X X X J# %*# X X X X X X +J+ X X JU X X vV %v X X X X J# X 
v *v # i' *p ^ *p ^p ^p -*p *v - p v ^ ^ v <np ^ y ♦p •'p ^p ^ ^p /p p /p 

❖ 

* COMPUTE EMPIRICAL DISTRIBUTION FUNCTION, S 

* 



X w V# V> \ 
<*P V -p -p ' 



* s^ajc#;}: ## #### sjc# jfcj 



C 3 ^# 7% # # 5 jt 5 ^ 



S( 1 ) = 0.0 
SUM = 0.0 
K = 2 
I = 1 

4 1 Y = I-f-I 

DC 5 K4 = I Y t N 

IF ( X ( J { K4 ) ) . GT . X ( J ( I ) ) ) GO TO 6 

5 CONTINUE 

6 1= K4 

SUM = $ UM+ ( K4- 1 Y + 1 ) / RN 
S ( K ) = SUM 
K = K + l 

IF (K4.EQ.N) GO TO 7 
GC TO 4 

7 S < K ) = 1.0 

# # 3{s # 3{t ;£ 5js * 5*C ^CSj* Sjcjfc #### ## #&#### ###### 5^ 3(s3jx 3jx 2jc ###:£ 

V. 

* COMPUTE DPLSt DMNS, AND D 

^ 5^ j*s >Jc 3(s ;}s :;c s}e ^ sjs 5{:^c 5*j5jc j*/ ^ 5^ 5}c >}c sj: * s’/ 5{c ^ 5}c 5^ s{c 5jc 4: 3^c ^ s;t 5j« 



8 DC 9 K 1 7 = 2 » M 

D1FF = H(K17)-S(K17) 

0IFF2 = -DIFF 

IF (DMNS.LT.DI FF) DMNS=D I FF 
IF (DPLS. LT.0IFF2 ) 0PLS = 0IFF2 

9 CONTINUE 

D = DMNS 

IF (DPLS.GT.D) D = DPLS 
NMNS = PN*(1.0-3MNS) +0.9999 
NFLS = RNv( 1 .O-DPLS) +0.9999 
NO = R N Y ( 1 .0-0) +0.9999 
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* -ft -& 



c 

c 

c 

c 

c 

c 

c 



c 



c 



c 

c 

c 



c 



c 

c 

c 

c 

c 

c 

c 

c 



c 



c 



c 

c 

c 



; 



## sjc lit#### # £ ^ ^ ^ ^ s): afc sfc # # 



* COMPUTE C'S AND F'S 

# 



* 



NC = 1 



10 

11 



12 

13 

14 



15 

16 



17 

18 
19 



20 

21 



22 

23 

24 



DC 14 K 1 8 = 1 » NM N S 

ORC = DMNSH K18-1 .OJ/RN 

DC 10 K19=NC»MP1 

IF (0RD.LT.HIK19) ) GO TO 11 

CONTINUE 

IY = K19-1 
GPH = 0 PD-H ( IY ) 

IF (ABS(OMH) .LE.EPS) GO TO 12 
C ( K 1 8 ) = 1.0-HIK19) 

GC TO 13 

CIK18) = l.O-ORD 
NC = IY 
CCNT IMUE 

NC = 1 

DC 19 K 2 0 = 1 » N PL S 

CRD = 1 . O-DPLS- ( K 20-1 . 0 } / RN 

DC 15 K21=NC,MP1 
N6 = MP 1 -K 21+1 
IF (0RD.GT.HIN8) ) GO TO 16 
CONTINUE 

IY = NB+1 

H PC = H ( I Y ) -ORD 

IF (ABS (HMOi .LE.EPS) GO TO 17 

F ( K 2 0 ) = H(NB) 

GC TO 18 
F ( K20 ) = ORD 
NC = MP1-N8 
CCNl INUE 



^ %*/ VX %•# «.V ^ ^ vV mIU J# %V ^ X %*/ V ' %/* *.*y ^ v*y O/ «JU X X X X X X 'j' X X- X X x X v'/ X 

-r n* +•* -r 'r 'o r t r t t v ^ - s* v v ^ *v v v 'r t t t t n ' t ^ t t t ^ ^ t *r -v T v v v 

* 

* COMPUTE CD* S AND FD c S 

*,£ afc # £ aj' ### sjc sjc 5j< ?Jc £ ><c # # .£ £ 5 J: £ # i£ *}: ## 



NC = 1 

DC 24 K22= 1 t ND 

ORD = D+1K22-1.0) /RN 

DC 20 K23=NC,MP1 

I F (ORD.LT .H(K23 )) GO TO 21 

CONTINUE 

IY = K23-1 
ONH = ORD-HI IY ) 

IF (ABStOMH) .LE.EPS) GO TO 22 
CDIK22J = 1 . 0-H ( K23 ) 

GC TO 23 

CLIK22) = l.O-ORD 
NC = IY 
CONTINUE 

NC = 1 



DC 29 K24=1,ND 

ORD = 1 .0-0-IK24-1 .0) /RN 
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* 



c 



c 

c 

c 

c 

c 

c 

c 

c 



c 



c 

c 

c 

c 

c 

c 

c 

c 

c 



c 



c 



c 

c 



c 



c 



c 



c 



25 

26 



27 

28 
29 



CC 25 K 2 5 = NC , M P 1 
N6 = MPI-K25+1 
IF (GRC.GT.H(MB) ) GO TO 26 
C-CMINUE 

IY = NB + 1 
HFC = H < I Y ) -ORD 
IF (ABSIHMO.LE.EPS) GO TO 27 
F B ( K24 ) = H(NB) 

GO TO 23 
FCIK24) = ORD 
NC = MP1-NB 
CONT INUE 



?}: 3}:^: # :£ j{c # =}e # sjc j?c # jfc # jJjjJ? # ?Jc : 









sj< jJj 

* COMPUTE CO(I,J), COMBS 1-1 TAKEN J-l AT A TIME * 



£ 

^ 2y* 2p 2|> 2^ 2yC 



NPl = N + 1 

DC 31 1=2, NPl 
CO< I , 1 ) = 1.0 
IM1 = 1-1 

DC 30 J J = 2 , I 
JK1 = JJ-1 

CO(I,JJJ = (C0( I , JM1 )*( I-JJ + 1.0) )/( JJ-1) 

3C CONTINUE 

31 CONTINUE 



■tyt/iiii It 



: £>!c: 



* »V «-V ^ s»# vV- «JU J# 

* +r n* ^ -r* -v rr V 



* COMPUTE B‘S» E’S, eD‘S 

* 



J* s 1 / «'< %V v‘y 



■sff ~ V »’< J, J/ sJ. .. s.1^ J, Jt, X 5>J >*» 

'(*• ',*• -<* v '|> -T- T - -T* V '<•* -V <T* 'r V 'l' ■ 



AND ED'S * 

## # ^ =Jc ^ sjc ## 



B(l) = i.O 

DC 33 K26= 2 , NMNS 
B SUM = i.O 
IY = K26-1 

DO 32 K 2 7= i , I Y 

B SUM = BSUM-COl K26,K27)*(C(K27)^ ; ! : ( K26-K 2 7 ) ) *3 ( K2 7 ) 

32 CCNTINUE 

6 ( K 26 ) = B SUM 

33 CCNTINUE 

E( 1) = 1.0 

DC 35 K 2 8= 2 , NPLS 
ESUM = 1.0 
JY = K28-1 

DC 34 K 25= 1 , IY 

ESUM = ESUM-CO ( K2 8 , K2 9 ) * ( F ( K29 ) ** ( K28-K2 5) I *E ( K29 ) 

34 CCNTINUE 

E ( K 2 8 ) = ESUM 

35 CONTINUE 

BC ( 1 ) = 1.0 

EDI II - I.O 

DC 37 K30=2 , NO 
B SUM = 1.0 
FSLM = 1.0 
IY = K30-1 
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00 36 K 3 1= 1 » !Y 

6SUM = BSUM-CCU K30 ,K31) *( C0(K3 I ) **(K30-K31 ) )*8D( K31) 
ESUM = ESUM-ClMK 30,K31)*(FD<K31)**(K30-K31) >*SD(K3i) 
3o CC MINUS 
C 

BCIK30) = BSUM 
EDI K30 ) = ESUM 
37 CONTINUE 
C 



DC 38 K32 = l t NMNS 

PDPNS = PDMNS +C0 { N P i ,K32 ) *B ( K32 ) * ( Cl K32 ) ** { N-K32+ I ) ) 

38 CONTINUE 
C 

C 

DC 39 K33=1,NPLS 

PDPLS = PQPLS+C0(NP1 ,K33) *E( K33 )* ( F(K33 ) ** ( N-K33+1 ) ) 

39 CONTINUE 
C 

C 

DC 40 K 34=1, NO 
IY = N-K34+1 
Y = C0(N'Pl,K34) 

PCM = P DM + Y* BD ( K3 4 ) C D ( K34 ) I Y ) 

PCP = P D P <- Y^ E 0 ( K 3 4 i v ( F D ( K 3 4 ) -f* i Y ) 

40 CONTINUE 
C 

PD = FCM+PDP 
POL = PC-PCM*PDP 
RETURN 
END 



C 

C 

C 

C 

C 

C 




FDMNS = 0.0 
PDPLS = 0.0 
POP = 0.0 
FDM = 0.0 



C 
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